SQL Data Cleaning

In this SQL project, I use sql to clean data that can later be used for further analysis. Here is a step to step walk through of the project.

Creating a Database and Importing Data

I started by creating a new database and importing the data.

1/2

❮

❯

Data Overview

I then had an overall overview of the data to understand it and to determine the what, where and how it would be cleaned.

1/2

❮

❯

Date Standardization

I then standardized the date column. This involved:

Converting the Sale Date column from Date-time format to date format. This however does not alter the column.
Adding a new column "sale_date_converted" to add the converted Sale Date Column.
Updating the new sale_date_converted column. We will later drop the SaleDate column.
Checking to see whether the new column has been added correctly.

1/4

❮

❯

Populating null/missing values in the Property Address column

I then went ahead to populate the missing values in the property address column. I observed that for the missing values, there was a similar row entry with the same parcel ID but different Unique IDs. So the next step was to populate the missing property address with the available property address of the other entry with the same parcel ID as they should be similar.

1/3

❮

❯

Splitting the Property Address Column

I then went ahead to split the property Address Column into the actual address and the associated city which are in one column.

1/7

❮

❯

Splitting the Owner Address Column

I then split the owner Address Column

1/9

❮

❯

SoldAsVacant column standardization

The next step was to standardize the SoldAsVacant column. The first step was to find out the values present and their count. This would give us a basis on how to standardize it based on the entries with the highest count. Since 'Yes' and 'No' have the highest count, we convert the Y and N into Yes and No respectively. This also makes it more understandable as Y and N can be confusing and open to misinterpretation on what they could mean.

1/5

❮

❯

Removing Duplicates

I then went ahead to remove duplicates in our value. This however is not standard practice and should be done with care. Alternatively, creating a working sheet from the original data for such manipulation is advisable while reserving the original data as it is.

1/3

❮

❯

Dropping Unused Columns

Finally, I finished off by deleting unused columns. This columns are Owner Address, TaxDistrict, PropertyAdress and SaleDate columns.

1/1

❮

❯

To download and view the full project on GitHub, click here.